library(tidyverse)
library(dslabs)
library(Lahman)
library(HistData)
library(broom)
library(gridExtra)
library(broom)

Linear regression is commonly used to quantify the relationship between two or more variables. It is also used to adjust for confounding. In this course, we cover how to implement linear regression and adjust for confounding in practice using R.

The class notes for this course series can be found in Professor Irizarry’s freely available Introduction to Data Science book.

Course overview

There are three major sections in this course: introduction to linear regression, linear models, and confounding.

1. Introduction to Linear Regression

In this section, you’ll learn the basics of linear regression through this course’s motivating example, the data-driven approach used to construct baseball teams. You’ll also learn about correlation, the correlation coefficient, stratification, and the variance explained.

2. Linear Models

In this section, you’ll learn about linear models. You’ll learn about least squares estimates, multivariate regression, and several useful features of R, such as tibbles, lm, do, and broom. You’ll learn how to apply regression to baseball to build a better offensive metric.

3. Confounding

In the final section of the course, you’ll learn about confounding and several reasons that correlation is not the same as causation, such as spurious correlation, outliers, reversing cause and effect, and confounders. You’ll also learn about Simpson’s Paradox.


1. Introduction to Linear Regression

In the Introduction to Regression section, you will learn the basics of linear regression.

After completing this section, you will be able to:

This section has three parts: Baseball as a Motivating Example, Correlation, and Stratification and Variance Explained. There are comprehension checks that follow most videos.

1.1 Baseball as a Motivating Example

1.1.1 Motivating Example: Moneyball

As motivation for this course, we’ll go back to 2002 and try to build a baseball team with a limited budget. Note that in 2002, the Yankees payroll was almost $ 130 million, and had more than tripled the Oakland A’s $ 40 million budget. Statistics have been used in baseball since its beginnings. Note that the data set we will be using, included in the Lahman Library, goes back to the 19th century. For example, a summary of statistics we will describe soon, the batting average, has been used to summarize a batter’s success for decades. Other statistics such as home runs, runs batted in, and stolen bases, we’ll describe all this soon, are reported for each player in the game summaries included in the sports section of newspapers. And players are rewarded for high numbers. Although summary statistics were widely used in baseball, data analysis per se was not. These statistics were arbitrarily decided on without much thought as to whether they actually predicted, or were related to helping a team win. This all changed with Bill James. In the late 1970s, this aspiring writer and baseball fan started publishing articles describing more in-depth analysis of baseball data. He named the approach of using data to predict what outcomes best predict if a team wins sabermetrics. Until Billy Beane made sabermetrics the center of his baseball operations, Bill James’ work was mostly ignored by the baseball world. Today, pretty much every team uses the approach, and it has gone beyond baseball into other sports. In this course, to simplify the example we use, we’ll focus on predicting scoring runs. We will ignore pitching and fielding, although those are important as well. We will see how regression analysis can help develop strategies to build a competitive baseball team with a constrained budget.

The approach can be divided into two separate data analyses.
In the first, we determine which recorded player specific statistics predict runs.
In the second, we examine if players were undervalued based on what our first analysis predicts.

Question 1

What is the application of statistics and data science to baseball called?

  • Moneyball
  • Sabermetrics
  • The “Oakland A’s Approach”
  • There is no specific name for this; it’s just datascience.

1.1.2 Baseball Basics

We actually don’t need to understand all the details about the game of baseball, which has over 100 rules, to see how regression will help us find undervalued players. Here, we distill the sport to the basic knowledge one needs to know to effectively attack the data science challenge. Let’s get started. The goal of a baseball game is to score more runs, they’re like points, than the other team. Each team has nine batters that bat in a predetermined order. After the ninth batter hits, we start with the first again. Each time they come to bat, we call it a plate appearance, PA. At each plate appearance, the other team’s pitcher throws the ball and you try to hit it. The plate appearance ends with a binary outcome– you either make an out, that’s a failure and sit back down, or you don’t, that’s a success and you get to run around the bases and potentially score a run. Each team gets nine tries, referred to as innings, to score runs. Each inning ends after three outs, after you’ve failed three times. From these examples, we see how luck is involved in the process. When you bat you want to hit the ball hard. If you hit it hard enough, it’s a home run, the best possible outcome as you get at least one automatic run. But sometimes, due to chance, you hit the ball very hard and a defender catches it, which makes it an out, a failure. In contrast, sometimes you hit the ball softly but it lands just in the right place. You get a hit which is a success. The fact that there is chance involved hints at why probability models will be involved in all this. Now there are several ways to succeed. Understanding this distinction will be important for our analysis. When you hit the ball you want to pass as many bases as possible. There are four bases with the fourth one called home plate. Home plate is where you start, where you try to hit. So the bases form a cycle. If you get home, you score a run. We’re simplifying a bit. But there are five ways you can succeed. In other words, not making an out.

5 ways to succeed:
- base on balls (BB)
- single
- double (X2B)
- triple (X3B)
- home run (HR)

First one is called a base on balls. This is when the pitcher does not pitch well and you get to go to first base. A single is when you hit the ball and you get to first base. A double is when you hit the ball and you go past first base to second. Triple is when you do that but get to third. And a home run is when you hit the ball and go all the way home and score a run. If you get to a base, you still have a chance of getting home and scoring a run if the next batter hits successfully. While you are on base, you can also try to steal a base. If you run fast enough, you can try to go from first to second or from second to third without the other team tagging you. All right. Now historically, the batting average has been considered the most important offensive statistic. To define this average, we define a hit and an at bat. Singles, doubles, triples, and home runs are hits. But remember, there’s a fifth way to be successful, the base on balls. That is not a hit. An at bat is the number of times you either get a hit or make an out, bases on balls are excluded.

\[\textit{batting average} = \frac{H}{AB}\]

The batting average is simply hits divided by at bats. And it is considered the main measure of a success rate. Today, in today’s game, this success rates ranges from player to player from about 20% to 38%. We refer to the batting average in thousands. So for example, if your success rate is 25% we say you’re batting 250. One of Bill James’ first important insights is that the batting average ignores bases on balls but bases on balls is a success. So a player that gets many more bases on balls than the average player might not be recognized if he does not excel in batting average. But is this player not helping produce runs? No award is given to the player with the most bases on balls. In contrast, the total number of stolen bases are considered important and an award is given out to the player with the most. But players with high totals of stolen bases also make outs as they do not always succeed. So does a player with a high stolen base total help produce runs? Can we use data size to determine if it’s better to pay for bases on balls or stolen bases? One of the challenges in this analysis is that it is not obvious how to determine if a player produces runs because so much depends on his teammates. We do keep track of the number of runs scored by our player. But note that if you hit after someone who hits many home runs, you will score many runs. But these runs don’t necessarily happen if we hire this player but not his home run hitting teammate. However, we can examine team level statistics. How do teams with many stolen bases compare to teams with few? How about bases on balls? We have data. Let’s examine some.

Question 1

Which of the following outcome is not included in the batting average?

  • A home run
  • A base on balls
  • An out
  • A single

Question 2

Why do we consider team statistics as well as individual statistics?

  • The success of any individual player also depends on the strength of their team.
  • Team statistics can be easier to calculate.
  • The ultimate goal of sabermatics is to rank teams, not players.

1.1.3 Bases on Balls or Stolen Bases

Let’s start looking at some baseball data and try to answer your questions using these data. First one, do teams that hit more home runs score more runs? We know what the answer to this will be, but let’s look at the data anyways. We’re going to examine data from 1961 to 2001. We end at 2001 because, remember, we’re back in 2002, getting ready to build a team. We started in 1961, because that year, the league changed from 154 games to 162 games. The visualization of choice when exploring the relationship between two variables like home runs and runs is a scatterplot.

data("Teams")
ds_theme_set()
Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(HR_per_game = HR / G, R_per_game = R / G) %>%
  ggplot(aes(HR_per_game, R_per_game)) +
  geom_point(alpha = 0.5)

The following code shows you how to make that scatterplot. We start by loading the Lahman library that has all these baseball statistics. And then we simply make a scatterplot using ggplot. Here’s a plot of runs per game versus home runs per game. The plot shows a very strong association– teams with more home runs tended to score more runs.

Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(SB_per_game = SB / G, R_per_game = R / G) %>%
  ggplot(aes(SB_per_game, R_per_game)) +
  geom_point(alpha = 0.5)

Now, let’s examine the relationship between stolen bases and wins. Here are the runs per game plotted against stolen bases per game. Here, the relationship is not as clear.

Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(BB_per_game = BB / G, R_per_game = R / G) %>%
  ggplot(aes(BB_per_game, R_per_game)) +
  geom_point(alpha = 0.5)

Finally, let’s examine the relationship between bases on balls and runs. Here are runs per game versus bases on balls per game. Although the relationship is not as strong as it was for home runs, we do see a pretty strong relationship here. We know that, by definition, home runs cause runs, because when you hit a home run, at least one run will score. Now it could be that home runs also cause the bases on balls. If you understand the game, you will agree with me that that could be the case. So it might appear that a base on ball is causing runs, when in fact, it’s home runs that’s causing both. This is called confounding. An important concept you will learn about. Linear regression will help us parse all this out and quantify the associations. This will then help us determine what players to recruit. Specifically, we will try to predict things like how many more runs will the team score if we increase the number of bases on balls but keep the home runs fixed. Regression will help us answer this question, as well.

Question 1

You want to know whether teams with more at-bats per game have more runs per game. What R code below correctly makes a scatterplot for this relationship

Teams %>% filter(yearID %in% 1961:2001) %>%
  ggplot(aes(AB, R)) +
  geom_point(alpha = 0.5)

Teams %>% filter(yearID %in% 1961:2001) %>%
  mutate(AB_per_game = AB / G, R_per_game = R / G) %>%
  ggplot(aes(AB_per_game, R_per_game)) +
  geom_point(alpha = 0.5)

Teams %>% filter(yearID %in% 1961:2001) %>%
  mutate(AB_per_game = AB / G, R_per_game = R / G) %>%
  ggplot(aes(AB_per_game, R_per_game)) +
 geom_line()

Teams %>% filter(yearID %in% 1961:2001) %>%
  mutate(AB_per_game = AB / G, R_per_game = R / G) %>%
  ggplot(aes(R_per_game, AB_per_game)) +
  geom_point()

Answer is B

Question 2

What does the variable ‘SOA’ stand for in the Teams table? Hint: make sure to use the help file (?Teams)

  • sacrifice out
  • slides or attempts
  • strikeouts by pitcher
  • accumulated singles

1.2 Correlation

1.2.1 Correlation

Up to now in this series, we have focused mainly on univariate variables. However, in data science application it is very common to be interested in the relationship between two or more variables. We saw this in our baseball example in which we were interested in the relationship, for example, between bases on balls and runs. We’ll come back to this example, but we introduce the concepts of correlation and regression using a simpler example. It is actually the dataset from which regression was born. We examine an example from genetics. Francis Galton studied the variation and heredity of human traits. Among many other traits, Galton collected and studied height data from families to try to understand heredity. While doing this, he developed the concepts of correlation and regression, and a connection to pairs of data that follow a normal distribution. Note that, at the time this data was collected, what we know today about genetics was not yet understood. A very specific question Galton tried to answer was, how much of a son’s height can I predict with the parents height. Note that this is similar to predicting runs with bases on balls. We have access to Galton’s family data through the HistData package. HistData stands for historical data.

galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == 'male') %>%
  select(father, childHeight) %>%
  rename(son = childHeight)

We’ll create a data set with the heights of fathers and the first sons. The actual data Galton used to discover and define regression. So we have the father and son height data. Suppose we were to summarize these data. Since both distributions are well approximated by normal distributions, we can use the two averages and two standard deviations as summaries.

galton_heights %>% 
  summarize(mean(father), sd(father), mean(son), sd(son))
##   mean(father) sd(father) mean(son)  sd(son)
## 1     69.09888   2.546555  70.45475 2.557061

Here they are. You can see the average heights for fathers is 69 inches. The standard deviation is 2.54. For sons, they’re a little taller, because it’s the next generation. The average height is 70.45 inches, and the standard deviation is 2.55 inches.

galton_heights %>%
  ggplot(aes(father, son)) +
  geom_point(alpha = 0.5)

However, this summary fails to describe a very important characteristic of the data that you can see in this figure. The trend that the taller the father, the taller the son, is not described by the summary statistics of the average and the standard deviation. We will learn that the correlation coefficient is a summary of this trend.

Question 1

While studying heredity, Francis Galton developed what important statistical concept?

  • Standard deviation
  • Normal distribution
  • Correlation
  • Probability

Question 2

The correlation coefficient is a summary of what?

  • The trend between two variables
  • The dispersion of a variable
  • The central tendency of a variable
  • The distribution of a variable

1.2.2 Correlation Coefficient

The correlation coefficient is defined for a list of pairs – \((x_1, y_1),...,(x_n, y_n)\) – with the following formula: \[\rho = \frac{1}{n} \sum_{i=1}^{n} \Bigg(\frac{x_i-\mu_x}{\sigma_x}\Bigg) * \Bigg(\frac{y_i-\mu_y}{\sigma_y}\Bigg)\]

Here, \(\mu_x\) and \(\mu_y\) are the averages of x and y, respectively. And \(\sigma_x\) and \(\sigma_y\) are the standard deviations. The Greek letter rho is commonly used in the statistics book to denote this correlation. The reason is that rho is the Greek letter for r, the first letter of the word regression. Soon, we will learn about the connection between correlation and regression. To understand why this equation does, in fact, summarize how two variables move together, consider the i-th entry of x is \(x_i\) minus \({\mu_x}\) divided by \(\sigma_x\) SDs away from the average. \[\bigg( \frac{x_i - \mu_x}{\sigma_x} \bigg) \]

Similarly, the \(y_i\) – which is paired with the \(x_i\)– is \(y_i\) minus \(\mu_y\) divided by \(\sigma_y\) SDs away from the average y. \[\bigg( \frac{y_i - \mu_y}{\sigma_y} \bigg) \]

The average (mean): \[\mu_x = \frac{1}{n} \sum_{i=1}^n x_i=\frac{1}{n}\big(x_1+x_2+x_3+...+x_n\big) \]

The standard deviation (sd): \[\sigma_x = \frac{1}{n} \sum_{i=1}^n \sqrt{\frac{(x-\mu_x)^2}{n-1}}\]

The variance: \[ variance = 2 *\sigma \]

If x and y are unrelated, then the product of these two quantities will be positive. That happens when they are both positive or when they are both negative as often as they will be negative. That happens when one is positive and the other is negative, or the other way around. One is negative and the other one is positive. This will average to about 0. The correlation is this average. And therefore, unrelated variables will have a correlation of about 0. If instead the quantities vary together, then we are averaging mostly positive products. Because they’re going to be either positive times positive or negative times negative. And we get a positive correlation. If they vary in opposite directions, we get a negative correlation. Another thing to know is that we can show mathematically that the correlation is always between negative 1 and 1. To see this, consider that we can have higher correlation than when we compare a list to itself. That would be perfect correlation.

\[\rho = \frac{1}{n} \sum_{i=1}^{n} \Bigg(\frac{x_i-\mu_x}{\sigma_x}\Bigg)^2 = \frac{1}{\sigma_x^2} \frac{1}{n} \sum_{i=1}^{n} (x_i-\mu_x)^2 = 1\]

In this case, the correlation is given by this equation, which we can show is equal to 1. A similar argument with x and its exact opposite, negative x, proves that the correlation has to be greater or equal to negative 1. So it’s between minus 1 and 1.

galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == 'male') %>%
  select(father, childHeight) %>%
  rename(son = childHeight)
galton_heights %>% summarize(cor(father, son))
##   cor(father, son)
## 1        0.5007248

The correlation between father and sons’ height is about 0.5. You can compute that using this code. We saw what the data looks like when the correlation is 0.5.

To see what data looks like for other values of rho, here are six examples of pairs with correlations ranging from negative 0.9 to 0.99. When the correlation is negative, we see that they go in opposite direction. As x increases, y decreases. When the correlation gets either closer to 1 or negative 1, we see the clot of points getting thinner and thinner. When the correlation is 0, we just see a big circle of points.

Question 1

Below is a scatterplot showing the relationship between two variables, x and y.

From this figure, the correlation between x and y appears to be about:

  • -0.9
  • -0.2
  • 0.9
  • 2

1.2.3 Sample Correlation is a Random Variable

Before we continue describing regression, let’s go over a reminder about random variability. In most data science applications, we do not observe the population, but rather a sample. As with the average and standard deviation, the sample correlation is the most commonly used estimate of the population correlation. This implies that the correlation we compute and use as a summary is a random variable. As an illustration, let’s assume that the 179 pairs of fathers and sons is our entire population. A less fortunate geneticist can only afford to take a random sample of 25 pairs.

set.seed(0)
R <- sample_n(galton_heights, 25, replace = TRUE) %>%
  summarize(cor(father, son))
R
##   cor(father, son)
## 1        0.6687255

The sample correlation for this random sample can be computed using this code. Here, the variable R is the random variable.

B <- 1000
N <- 25
R <- replicate(B,{
  R <- sample_n(galton_heights, N, replace = TRUE) %>%
    summarize(r = cor(father, son)) %>% .$r
})

data.frame(R) %>%
  ggplot(aes(R)) +
  geom_histogram(binwidth = 0.05, color ="black")

mean(R)
## [1] 0.5005559
sd(R)
## [1] 0.1472816

We can run a Monte-Carlo simulation to see the distribution of this random variable. Here, we recreate R 1000 times, and plot its histogram. We see that the expected value is the population correlation, the mean of these Rs is 0.5, and that it has a relatively high standard error relative to its size, SD 0.147. This is something to keep in mind when interpreting correlations. It is a random variable, and it can have a pretty large standard error. Also note that because the sample correlation is an average of independent draws, the Central Limit Theorem actually applies.

\[ R \sim N \bigg(\rho,\sqrt{\frac{1-r^2}{N-2}} \bigg) \]

Therefore, for a large enough sample size N, the distribution of these Rs is approximately normal. The expected value we know is the population correlation. The standard deviation is somewhat more complex to derive, but this is the actual formula here.

qqnorm(R); qqline(R)

In our example, N equals to 25, does not appear to be large enough to make the approximation a good one, as we see in this qq-plot.

Question 1

Instead of running a Monte Carlo simulation with a sample size of 25 from our 179 father-son pairs, we now run our simulation with a sample size of 50.

Would you expect the mean of our sample correlation to increase, decrease, or stay approximately the same?

  • Increase
  • Decrease
  • Stay approximately the same

Question 2

Instead of running a Monte Carlo simulation with a sample size of 25 from our 179 father-son pairs, we now run our simulation with a sample size of 50.

Would you expect the standard deviation of our sample correlation to increase, decrease, or stay approximately the same?

  • Increase
  • Decrease
  • Stay approximately the same

1.3 Stratification and Variance Explained

1.3.1 Anscombe’s Quartet/Stratification

Correlation is not always a good summary of the relationship between two variables.

A famous example used to illustrate this are the following for artificial data sets, referred to as Anscombe’s quartet. All of these pairs have a correlation of 0.82. Correlation is only meaningful in a particular context. To help us understand when it is that correlation is meaningful as a summary statistic, we’ll try to predict the son’s height using the father’s height. This will help motivate and define linear regression. We start by demonstrating how correlation can be useful for prediction. Suppose we are asked to guess the height of a randomly selected son. Because of the distribution of the son height is approximately normal, we know that the average height of 70.5 inches is a value with the highest proportion and would be the prediction with the chances of minimizing the error. But what if we are told that the father is 72 inches? Do we still guess 70.5 inches for the son? The father is taller than average, specifically he is 1.14 standard deviations taller than the average father. So shall we predict that the son is also 1.14 standard deviations taller than the average son? It turns out that this would be an overestimate. To see this, we look at all the sons with fathers who are about 72 inches. We do this by stratifying the father’s side. We call this a conditional average, since we are computing the average son height conditioned on the father being 72 inches tall. A challenge when using this approach in practice is that we don’t have many fathers that are exactly 72. In our data set, we only have eight. If we change the number to 72.5, we would only have one father who is that height. This would result in averages with large standard errors, and they won’t be useful for prediction for this reason. But for now, what we’ll do is we’ll take an approach of creating strata of fathers with very similar heights. Specifically, we will round fathers’ heights to the nearest inch. This gives us the following prediction for the son of a father that is approximately 72 inches tall. We can use this code and get our answer, which is 71.84. This is 0.54 standard deviations larger than the average son, a smaller number than the 1.14 standard deviations taller that the father was above the average father. Stratification followed by box plots lets us see the distribution of each group. Here is that plot. We can see that the centers of these groups are increasing with height, not surprisingly. The means of each group appear to follow a linear relationship. We can make that plot like this, with this code. See the plot and notice that this appears to follow a line. The slope of this line appears to be about 0.5, which happens to be the correlation between father and son heights. This is not a coincidence. To see this connection, let’s plot the standardized heights against each other, son versus father, with a line that has a slope equal to the correlation. Here’s the code. Here’s a plot. This line is what we call the regression line. In a later video, we will describe Galton’s theoretical justification for using this line to estimate conditional means. Here, we define it and compute it for the data at hand. The regression line for two variables, x and y, tells us that for every standard deviation \(\sigma_x\) increase above the average \(\mu_x\). For x, y grows \(\rho\) standard deviations \(\sigma_y\) above the average \(\mu_y\).

\[\bigg( \frac{y_i-\mu_y}{\sigma_y} \bigg) = \rho \bigg( \frac{x_i-\mu_x}{\sigma_x} \bigg)\]

The formula for the regression line is therefore this one. If there’s perfect correlation, we predict an increase that is the same number of SDs. If there’s zero correlation, then we don’t use x at all for the prediction of y. For values between 0 and 1, the prediction is somewhere in between. If the correlation is negative, we predict a reduction, instead of an increase. It is because when the correlation is positive but lower than the one, that we predict something closer to the mean, that we call this regression. The son regresses to the average height. In fact, the title of Galton’s paper was “Regression Towards Mediocrity in Hereditary Stature.”

\[y=b+mx\\slope~(m)=\rho\frac{\sigma_y}{\sigma_x} \\intercept~(b)=\mu_y-m\mu_x \]

Note that if we write this in the standard form of a line, y equals b plus mx, where b is the intercept and m is the slope, the regression line has slope rho times sigma y, divided by sigma x, and intercept mu y, minus mu x, times the slope. So if we standardize the variable so they have average 0 and standard deviation 1. Then the regression line has intercept 0 and slope equal to the correlation rho. Let’s look at the original data, father son data, and add the regression line. We can compute the intercept and the slope using the formulas we just derived. Here’s a code to make the plot with the regression line. If we plot the data in standard units, then, as we discussed, the regression line as intercept 0 and slope rho. Here’s the code to make that plot. We started this discussion by saying that we wanted to use the conditional means to predict the heights of the sons. But then we realized that there were very few data points in each strata. When we did this approximation of rounding off the height of the fathers, we found that these conditional means appear to follow a line. And we ended up with the regression line. So the regression line gives us the prediction. An advantage of using the regression line is that we used all the data to estimate just two parameters, the slope and the intercept. This makes it much more stable. When we do conditional means, we had fewer data points, which made the estimates have a large standard error, and therefore be unstable. So this is going to give us a much more stable prediction using the regression line. However, are we justified in using the regression line to predict? Galton gives us the answer.

Question 1

Look at the figure below. The slope of the regression line in this figure is equal to what, in words?

  • Slope = (correlation coefficient of son and father heights) * (standard deviation of sons’ heights / standard deviation of fathers’ heights)

  • Slope = (correlation coefficient of son and father heights) * (standard deviation of fathers’ heights / standard deviation of sons’ heights)

  • Slope = (correlation coefficient of son and father heights) / (standard deviation of sons’ heights * standard deviation of fathers’ heights)

  • Slope = (mean height of fathers) - (correlation coefficient of son and father heights * mean height of sons).

Question 2

Why does the regression line simplify to a line with intercept zero and slope \(\rho\) when we standardize our x and y variables? Try the simplification on your own first!

  • When we standardize variables, both x and y will have a mean of one and a standard deviation of zero. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: \(y_i~=~\rho*x_i\)

  • When we standardize variables, both x and y will have a mean of zero and a standard deviation of one. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: \(y_i~=~\rho*x_i\)

  • When we standardize variables, both x and y will have a mean of zero and a standard deviation of one. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: \(y_i~=~\rho+x_i\)

Question 3

What is a limitation of calculating conditional means?
Select ALL that apply.

  • Each stratum we condition on (e.g., a specific father’s height) may not have many data points.

  • Because there are limited data points for each stratum, our average values have large standard errors.

  • Conditional means are less stable than a regression line.

  • Conditional means are a useful theoretical tool but cannot be calculated.

1.3.2 Bivariate Normal Distribution

Correlation and the regression line are widely used summary statistics. But it is often misused or misinterpreted. Ascombe’s example provided example of data sets in which summarizing with a correlation would be a mistake. But we also see it in the media and in scientific literature as well. The main way we motivate the use of correlation involve what is called the bivariate normal distribution. When a pair of random variables is approximated by a bivariate normal distribution, the scatterplot looks like ovals, like American footballs. They can be thin. That’s when they have high correlation. All the way up to a circle shape when they have no correlation. We saw some examples previously. Here they are again. A more technical way to define the bivariate normal distribution is the following. First, this distribution is defined for pairs. So we have two variables, x and y. And they have paired values. They are going to be bivariate normally distributed if the following happens. If x is a normally distributed random variable, and y is also a normally distributed random variable– and for any grouping of x that we can define, say, with x being equal to some predetermined value, which we call here in this formula little x – then the y’s in that group are approximately normal as well. If this happens, then the pair is approximately bivariate normal. When we fix x in this way, we then refer to the resulting distribution of the y’s in the group – defined by setting x in this way – as the conditional distribution of y given x. \[f_{Y|X=x}~is~the~conditional~distribution\\and~E(Y|X=x)~is~the~conditional~expected~value\]

We write the notation like this for the conditional distribution and the conditional expectation. If we think the height data is well-approximated by the bivariate normal distribution, then we should see the normal approximation hold for each grouping. Here, we stratify the son height by the standardized father heights and see that the assumption appears to hold.

data("GaltonFamilies")

# Prepare the data set with the heights of fathers and the first son
galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == 'male') %>%
  select(father, childHeight) %>%
  rename(son = childHeight)

# bivariate normal distribution
galton_heights %>%
  mutate(z_father = round((father - mean(father)) / sd(father))) %>%
  filter(z_father %in% -2:2) %>%
  ggplot() +
  stat_qq(aes(sample = son)) +
  facet_wrap(~z_father)

Here’s the code that gives us the desired plot.

\[E(Y|X=x)~=~\mu_x~+~\rho~*~\frac{X~-~\mu_x}{\sigma_x}~*~\sigma_y\]

Now, we come back to defining correlation. Galton showed – using mathematical statistics – that when two variables follow a bivariate normal distribution, then for any given x the expected value of the y in pairs for which x is set at that value is mu y plus rho x minus mu x divided by sigma x times sigma y.

\[\frac{E(Y|X=x)~-~\mu_y}{\sigma_x}~=~\rho~*~\frac{x~-~\mu_x}{\sigma_x}\]

Note that this is a line with slope rho times sigma y divided by sigma x and intercept mu y minus n times mu x. And therefore, this is the same as the regression line we saw in a previous video. That can be written like this. So in summary, if our data is approximately bivariate, then the conditional expectation – which is the best prediction for y given that we know the value of x – is given by the regression line.

Question 1

A regression line is the best prediction of Y given we know the value of X when:

  • X and Y follow a bivariate normal distribution.

  • Both X and Y are normally distributed.

  • Both X and Y have been standardized.

  • There are at least 25 X-Y pairs.

1.3.3 Variances Explained

The theory we’ve been describing also tells us that the standard deviation of the conditional distribution that we described in a previous video is:

\[Var(Y|X=x)~=~\sigma_y~*~\sqrt{1~-~\rho^2} \]

This is where statements like x explains such and such percent of the variation in y comes from. Note that the variance of y is sigma squared. That’s where we start. If we condition on x, then the variance goes down to 1 minus rho squared times sigma squared y. So from there, we can compute how much the variance has gone down. It has gone down by rho squared times 100%. So the correlation and the amount of variance explained are related to each other. But it is important to remember that the variance explained statement only makes sense when the data is approximated by a bivariate normal distribution.

Question 1

We previously calculated that the correlation coefficient \(\rho\) between fathers’ and sons’ heights is 0.5.

Given this, what percent of the variation in sons’ heights is explained by fathers’ heights?

  • 0%

  • 25%

  • 50%

  • 75%

When two variables follow a bivariate normal distribution, the variation explained can be calculated as: \[\rho^2~*~100\]

1.3.3 There are Two Regression Lines

We computed a regression line to predict the son’s height from the father’s height. We used these calculations to get the slope and the intercept.

data("GaltonFamilies")

# Prepare the data set with the heights of fathers and the first son
galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == 'male') %>%
  select(father, childHeight) %>%
  rename(son = childHeight)

# Caluclate the means, standard deviations, correlation coefficient, intercept, and the slope
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m <- r * (s_y / s_x)
b <- mu_y - (m * mu_x)

\[E(Y|X=x)~=~b+mx~=~35,7 + 0,5x\]

This gives us the function that the conditional expectation of y given x is 35.7 plus 0.5 times x. So, what if we wanted to predict the father’s height based on the son’s?

\[x~\ne~\frac{\{E(Y|X=x)~-~b\}}{0,5}\]

It is important to know that this is not determined by computing the inverse function of what we just saw, which would be this equation here. We need to compute the expected value of x given y.

\[E(X|Y=y)~=~b+mx~=~34,0 + 0,5y\]

This gives us another regression function altogether, with slope and intercept computed like this. So now we get that the expected value of x given y, or the expected value of the father’s height given the son’s height, is equal to 34 plus 0.5 y, a different regression line.

So in summary, it’s important to remember that the regression line comes from computing expectations, and these give you two different lines, depending on if you compute the expectation of y given x or x given y.

data("GaltonFamilies")

# Prepare the data set with the heights of fathers and the first son
galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == 'male') %>%
  select(father, childHeight) %>%
  rename(son = childHeight)

# Caluclate the means, standard deviations, and correlation coefficient
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)

# Intercept and slope of the father, given the son
m_y <- r * (s_y / s_x)
b_y <- mu_y - (m_y * mu_x)

# Intercept and slope of the son, given the father
m_x <- r * (s_x / s_y)
b_x <- mu_x - (m_x * mu_y)

# Plot
galton_heights %>%
  ggplot(aes(father, son)) +
  geom_point(alpha = 0.5) +
  geom_abline(color='blue', intercept = b_y, slope = m_y) + # Father
  geom_abline(color='green', intercept = b_x, slope = m_x)  # Son

2. Linear Models

In the Linear Models section, you will learn how to do linear regression.

After completing this section, you will be able to:

This section has four parts: Introduction to Linear Models, Least Squares Estimates, Tibbles, do, and broom, and Regression and Baseball. There are comprehension checks that follow most videos.

2.1 Introduction to Linear Models

2.1.1 Confounding: Are BBs More Predictive?

In a previous video, we found that the slope of the regression line for predicting runs from bases on balls was 0.735.

So, does this mean that if we go and higher low salary players with many bases on balls that increases the number of walks per game by 2 for our team? Our team will score 1.47 more runs per game? We are again reminded that ASSOCIATION IS NOT CAUSATION. The data does provide strong evidence that a team with 2 more bases on balls per game than the average team scores 1.47 more runs per game, but this does not mean that bases on balls are the cause. If we do compute the regression line slope for singles, we get 0.449, a lower value. Note that a single gets you to first base just like a base on ball. Those that know a little bit more about baseball will tell you that with a single, runners that are on base have a better chance of scoring than with a base on balls. So, how can base on balls be more predictive of runs? The reason this happens is because of confounding.

data("Teams")

Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(Singles = ( H - HR - X2B - X3B) / G, BB = BB / G, HR = HR / G) %>%
  summarize(cor(BB, HR), cor(Singles, HR), cor(BB, Singles))
##   cor(BB, HR) cor(Singles, HR) cor(BB, Singles)
## 1   0.4039125       -0.1738404      -0.05605071

Note the correlation between homeruns, bases on balls, and singles. We see that the correlation between bases on balls and homeruns is quite high compared to the other two pairs. It turns out that pitchers, afraid of homeruns, will sometimes avoid throwing strikes to homerun hitters. As a result, homerun hitters tend to have more bases on balls. Thus, a team with many homeruns will also have more bases on balls than average, and as a result, it may appear that bases on balls cause runs. But it is actually the homeruns that caused the runs. In this case, we say that bases on balls are confounded with homeruns. But could it be that bases on balls still help? To find out, we somehow have to adjust for the homerun effect. Regression can help with this.

Question 1

Why is the number of home runs considered a confounder of the relationship between bases on balls and runs per game?

  • Home runs is not a confounder of this relationship.

  • Home runs are the primary cause of runs per game.

  • The correlation between home runs and runs per game is stronger than the correlation between bases on balls and runs per game.

  • Players who get more bases on balls also tend to have more home runs; in addition, home runs increase the points per game.

2.1.2 Stratification and Multivariate Regression

To try to determine if bases on balls is still useful for creating runs, a first approach is to keep home runs fixed at a certain value and then examine the relationship between runs and bases on balls. As we did when we stratified fathers by rounding to the closest inch, here, we can stratify home runs per game to the closest 10th.

data("Teams")

dat <- Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(HR_strata = round(HR / G, 1),
         BB_per_game = BB / G,
         R_per_game = R / G) %>%
  filter(HR_strata >= 0.4 & HR_strata <= 1.2)

dat %>%
  ggplot(aes(BB_per_game, R_per_game)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap(~HR_strata)

We filtered our strata with few points. We use this code to generate an informative data set. And then, we can make a scatter plot for each strata. A scatterplot of runs versus bases on balls.This is what it looks like. Remember that the regression slope for predicting runs with bases on balls when we ignore home runs was 0.735. But once we stratify by home runs, these slopes are substantially reduced.

dat %>%
  group_by(HR_strata) %>%
  summarize(slope = cor(BB_per_game, R_per_game) * sd(R_per_game) / sd(BB_per_game))
## # A tibble: 9 x 2
##   HR_strata slope
##       <dbl> <dbl>
## 1       0.4 0.734
## 2       0.5 0.566
## 3       0.6 0.412
## 4       0.7 0.285
## 5       0.8 0.365
## 6       0.9 0.261
## 7       1   0.511
## 8       1.1 0.454
## 9       1.2 0.440

We can actually see what the slopes are by using this code. We stratify by home run and then compute the slope using the formula that we showed you previously. These values are closer to the slope we obtained from singles, which is 0.449. Which is more consistent with our intuition. Since both singles and bases on ball get us to first base, they should have about the same predictive power. Now, although our understanding of the application – our understanding of baseball – tells us that home runs cause bases on balls and not the other way around, we can still check if, after stratifying by base on balls, we still see a home run effect or if it goes down.

dat_1 <- Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(BB_strata = round(BB / G, 1),
         HR_per_game = HR / G,
         R_per_game = R / G) %>%
  filter(BB_strata >= 2.8 & BB_strata <= 3.9)

dat_1 %>%
  ggplot(aes(HR_per_game, R_per_game)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap(~BB_strata)

dat_1 %>%
  group_by(BB_strata) %>%
  summarize(slope = cor(HR_per_game, R_per_game) * sd(R_per_game) / sd(HR_per_game))
## # A tibble: 12 x 2
##    BB_strata slope
##        <dbl> <dbl>
##  1       2.8  1.53
##  2       2.9  1.57
##  3       3    1.52
##  4       3.1  1.49
##  5       3.2  1.58
##  6       3.3  1.56
##  7       3.4  1.48
##  8       3.5  1.63
##  9       3.6  1.83
## 10       3.7  1.45
## 11       3.8  1.70
## 12       3.9  1.30

We use the same code that we just used for bases on balls. But now, we swap home runs for bases on balls to get this plot. In this case, the slopes are the following. You can see they are all around 1.5, 1.6, 1.7. So they do not change that much from the original slope estimate, which was 1.84. Regardless, it seems that if we stratify by home runs, we have an approximately bivariate normal distribution for runs versus bases on balls. Similarly, if we stratify by bases on balls, we have an approximately normal bivariate distribution for runs versus home runs. So what do we do? It is somewhat complex to be computing regression lines for each strata.

\[E[R~|~BB=x_1,~HR=x_2]~=~\beta_0~+~\beta_1(x_2)x_1~+~\beta(x_1)x_2 \]

We’re essentially fitting this model that you can see in this equation with the slopes for x1 changing for different values of x2 and vice versa. Here, x1 is bases on balls. And x2 are home runs. Is there an easier approach? Note that if we take random variability into account, the estimated slopes by strata don’t appear to change that much. If these slopes are in fact the same, this implies that this function beta 1 of x2 and the other function beta 2 of x1 are actually constant.

\[E[R~|~BB=x_1,~HR=x_2]~=~\beta_0~+~\beta_1x_1~+~\beta_2x_2 \]

Which, in turn, implies that the expectation of runs condition on home runs and bases on balls can be written in this simpler model. This model implies that if the number of home runs is fixed, we observe a linear relationship between runs and bases on balls. And that the slope of that relationship does not depend on the number of home runs. Only the slope changes as the home runs increase. The same is true if we swap home runs and bases on balls. In this analysis, referred to as multivariate regression, we say that the bases on balls slope beta 1 is adjusted for the home run effect. If this model is correct, then confounding has been accounted for. But how do we estimate beta 1 and beta 2 from the data? For this, we’ll learn about linear models and least squares estimates.

Question 1

As described in the video, when we stratified our regression lines for runs per game vs. bases on balls by the number of home runs, what happened?

  • The slope of runs per game vs. bases on balls within each stratum was reduced because we removed confounding by home runs.

  • The slope of runs per game vs. bases on balls within each stratum was reduced because there were fewer data points.

  • The slope of runs per game vs. bases on balls within each stratum increased after we removed confounding by home runs.

  • The slope of runs per game vs. bases on balls within each stratum stayed about the same as the original slope.

2.1.3 Linear Models

Since Galton’s original development, regression has become one of the most widely used tools in data science. One reason for this has to do with the fact that regression permits us to find relationships between two variables while adjusting for others, as we have just shown for bases on balls and home runs. This has been particularly popular in fields where randomized experiments are hard to run, such as economics and epidemiology. When we’re not able to randomly assign each individual to a treatment or control group, confounding is particularly prevalent. For example, consider estimating the effect of any fast foods on life expectancy using data collected from a random sample of people in some jurisdiction. Fast food consumers are more likely to be smokers, drinkers, and have lower incomes. Therefore, a naive regression model may lead to an overestimate of a negative health effect of fast foods. So how do we adjust for confounding in practice?We can use regression. We have described how, if data is bivariate normal, then the conditional expectation follow a regression line, that the conditional expectation as a line is not an extra assumption, but rather a result derived from the assumption, that they are approximately bivariate normal. However, in practice it is common to explicitly write down a model that describes the relationship between two or more variables using what is called a linear model. We know that linear here does not refer to lines exclusively, but rather to the fact that the conditional expectation is a linear combination of known quantities. Any combination that multiplies them by a constant and then adds them up with, perhaps, a shift.

\[ 2~+~3x~-~4y~+~5z\]

For example, 2 plus 3x minus 4y plus 5z is a linear combination of x, y, and z.

\[\beta_0~+~\beta_1x_1~+~\beta_2x_2\]

So beta 0 plus beta 1x1, plus beta 2x 2 is a linear combination of x1 and x2. The simplest linear model is a constant beta 0. The second simplest is a line, beta 0 plus beta 1x. For Galton’s data, we would denote and observe fathers’ heights with x1 through xn.

\[Y_i~=~\beta_0~+~\beta_1x_i~+~\epsilon_i,~i~=~1,\dots,N\]

Then we model end son heights we are trying to predict with the following model. Here, the little xi’s are the father’s heights, which are fixed not random, due to the conditioning. We’ve conditioned on these values. And then Yi is the random son’s height that we want to predict. We further assume that the errors that are denoted with the Greek letter for E, epsilon, epsilon i, are independent from each other, have expected value 0, and the standard deviation, which is usually called sigma, does not depend on i. It’s the same for every individual. We know the xi, but to have a useful model for prediction, we need beta 0 and beta 1. We estimate these from the data. Once we do, we can predict the sons’ heights from any father’s height, x. Note that if we further assume that the epsilons are normally distributed, then this model is exactly the same one we derived earlier for the bivariate normal distribution. A somewhat nuanced difference is that in the first approach, we assumed the data was a bivariate normal, and the linear model was derived, not assumed. In practice, linear models are just assumed without necessarily assuming normality. The distribution of the epsilons is not specified. But nevertheless, if your data is bivariate normal, the linear model that we just showed holds. If your data is not bivariate normal, then you will need to have other ways of justifying the model. One reason linear models are popular is that they are interpretable. In the case of Galton’s data, we can interpret the data like this. Due to inherited genes, the son’s height prediction grows by beta 1 for each inch we increase the father’s height x. Because not all sons with fathers of height x are of equal height, we need the term epsilon, which explains the remaining variability. This remaining variability includes the mother’s genetic effect, environmental factors, and other biological randomness. Note that given how we wrote the model, the intercept beta 0 is not very interpretable, as it is the predicted height of a son with a father with no height. Due to regression to the mean, the prediction will usually be a bit larger than 0, which is really not very interpretable. To make the intercept parameter more interpretable, we can rewrite the model slightly in the following way.

\[Y_i~=~\beta_0~+~\beta_1(x_i~-~\bar{x})~+~\epsilon_i,~i~=~1,\dots,N\]

Here, we have changed xi to xi minus the average height x bar. We have centered our covariate xi. In this case, beta 0, the intercept, would be the predicted height for the average father for the case where xi equals x bar

2.2 Least Squares Estimates (LSE)

2.2.1 Least Squares Estimates (LSE)

For linear models to be useful, we have to estimate the unknown parameters, the betas. The standard approach in science is to find the values that minimize the distance of the fitted model to the data. To quantify this, we use the least squares equation. For Galton’s data, we would write something like this.

\[RSS~=~\sum_{i=1}^{n}\{Y_i-(\beta_0+\beta_1x_i)\}^2\]

This quantity is called the Residual Sum of Squares, RSS. Once we find the values that minimize the RSS, we call the values the Least Squares Estimate, LSE , and denote them, in this case, with \(\hat\beta_0\) hat and \(\hat\beta_1\). Let’s write the function that computes the RSS for any pair of values, beta 0 and beta 1, for our heights data.

data("GaltonFamilies")

# Prepare the data set with the heights of fathers and the first son
galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == 'male') %>%
  select(father, childHeight) %>%
  rename(son = childHeight)

# RSS-function
rss <- function(beta0, beta1, data){
  resid <- galton_heights$son - (beta0 + beta1 * galton_heights$father)
  return(sum(resid^2))
}

It would look like this. So for any pair of values, we get an RSS. So this is a three-dimensional plot with beta 1 and beta 2, and x and y and the RSS as a z. To find the minimum, you would have to look at this three-dimensional plot. Here, we’re just going to make a two-dimensional version by keeping beta 0 fixed at 25. So it will be a function of the RSS as a function of beta 1.

# Linear Model - Least Square Estimates
beta1 = seq(0, 1, len=nrow(galton_heights))
results <- data.frame(beta1 = beta1,
                      rss = sapply(beta1, rss, beta0 = 25))
results %>% ggplot(aes(beta1, rss)) + geom_line() + 
  geom_line(aes(beta1, rss), col=2)

We can use this code to produce this plot. We can see a clear minimum for beta 1 at around 0.65. So you could see how we would pick the least squares estimates. However, this minimum is for beta 1 when beta 0 is fixed at 25. But we don’t know if that’s the minimum for beta 0. We don’t know if 25 komma 0.65 minimizes the equation across all pairs. We could use trial and error, but it’s really not going to work here. Instead we will use calculus. We’ll take the partial derivatives, set them equal to 0, and solve for beta 1 and beta 0. Of course, if we have many parameters, these equations can get rather complex. But there are functions in R that do these calculations for us. We will learn these soon. To learn the mathematics behind this, you can consult the book on linear models.

2.2.2 The lm Function

In r, we can obtain the least squares estimates using the lm function.

\[Y_i~=~\beta_0~+~\beta_1x_i~+~\epsilon_i\]

To fit the following model where Yi is the son’s height and Xi is the father height, we would write the following piece of code.

# LSE
fit <- lm(son ~ father, data = galton_heights)
fit
## 
## Call:
## lm(formula = son ~ father, data = galton_heights)
## 
## Coefficients:
## (Intercept)       father  
##     35.7125       0.5028

This gives us the least squares estimates, which we can see in the output of r. The general way we use lm is by using the tilde character to let lm know which is the value we’re predicting that’s on the left side of the tilde, and which variables we’re using to predict– those will be on the right side of the tilde. The intercept is added automatically to the model. So you don’t have to include it when you write it. The object fit that we just computed includes more information about the least squares fit. We can use the function summary to extract more of this information, like this. To understand some of the information included in this summary, we need to remember that the LSE are random variables. Mathematical statistics gives us some ideas of the distribution of these random variables. And we’ll learn some of that next.

#Question 1
data("Teams")

data_teams <- Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(R_per_game = R / G, BB_per_game = BB / G, HR_per_game = HR / G)

fit <-lm(R_per_game ~ BB_per_game + HR_per_game, data = data_teams)
fit
## 
## Call:
## lm(formula = R_per_game ~ BB_per_game + HR_per_game, data = data_teams)
## 
## Coefficients:
## (Intercept)  BB_per_game  HR_per_game  
##      1.7444       0.3874       1.5611

2.2.3 LSE are Random Variables

The LSE are derived from the data, Y1 through Yn, which are random. This implies that our estimates are random variables. To see this, we can run a Monte Carlo simulation in which we assume that the son and father height data that we have defines an entire population. And we’re going to take random samples of size 50 and compute the regression slope coefficient for each one.

data("GaltonFamilies")

# Prepare the data set with the heights of fathers and the first son
galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == 'male') %>%
  select(father, childHeight) %>%
  rename(son = childHeight)

# Monte-Carlo simulation and plotting them
B <- 1000
N <- 50
lse <- replicate(B,{
  sample_n(galton_heights, N, replace = TRUE) %>%
    lm(son~father, data = .) %>% .$coef
})
lse <- data.frame(beta_0 = lse[1,], beta_1 = lse[2,])

p1 <- lse %>%
  ggplot(aes(beta_0)) +
  geom_histogram(binwidth = 5, color = "black")

p2 <- lse %>%
  ggplot(aes(beta_1)) +
  geom_histogram(binwidth = 0.1, color = "black")

grid.arrange(p1, p2, ncol = 2)

We write this code, which gives us several estimates of the regression slope. We can see the variability of the estimates by plotting their distribution. Here you can see the histograms of the estimated beta 0’s and the estimated beta 1’s. The reason these look normal is because the central limit theorem applies here as well.For large enough N, the least squares estimates will be approximately normal with expected value beta 0 and beta 1 respectively. The standard errors are a bit complicated to compute, but mathematical theory does allow us to compute them, and they are included in the summary provided by the lm-function.

# Standard error for one simulated data set
sample_n(galton_heights, N, replace = TRUE) %>%
  lm(son~father, data = .) %>% summary
## 
## Call:
## lm(formula = son ~ father, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0535 -1.5535  0.0602  1.2818  4.2374 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.8737    10.4732   2.757 0.008227 ** 
## father        0.5909     0.1521   3.884 0.000314 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.204 on 48 degrees of freedom
## Multiple R-squared:  0.2392, Adjusted R-squared:  0.2233 
## F-statistic: 15.09 on 1 and 48 DF,  p-value: 0.0003136

Here are the estimated standard errors for one of our simulated data sets. You could see them at the second column in the coefficients table.

# Estimated standard errors for the lse-data set(see above)
lse %>% summarize(se_0 = sd(beta_0), se_1 = sd(beta_1))
##       se_0      se_1
## 1 8.980321 0.1293344

You can see that the standard errors estimates reported by the summary function are closed, so the standard errors that we obtain from our Monte Carlo simulation. The summary function also reports t-statistics –this is the t value column– and p-values. This is the Pr bigger than absolute value of t column. The t-statistic is not actually based on the central limit theorem, but rather on the assumption that the epsilons follow a normal distribution. Under this assumption, mathematical theory tells us that the LSE divided by their standard error, which we can see here and here, follow a t distribution with N minus p degrees of freedom, with p the number of parameters in our model, which in this case is 2. The 2p values are testing the null hypothesis that beta 0 is 0 and beta 1 is 0 respectively. Note that as we described previously, for large enough N, the central limit works, and the t distribution becomes almost the same as a normal distribution. So if either you assume the errors are normal and use the t distribution or if you assume that N is large enough to use the central limit theorem, you can construct confidence intervals for your parameters. We know here that although we will not show examples in this video, hypothesis testing for regression models is very commonly used in, for example, epidemiology and economics, to make statements such as the effect of A and B was statistically significant after adjusting for X, Y, and Z. But it’s very important to note that several assumptions – we just described some of them– have to hold for these statements to hold.

2.3 Tibbles, do, and broom

2.3.1 Advanced dplyr: Tibbles

Let’s go back to baseball. In a previous example, we estimated the regression lines to predict runs from bases and balls in different home run strata.

data("Teams")

dat <- Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(HR_per_game = round(HR / G, 1),
         BB_per_game = BB  / G,
         R_per_game  = R / G) %>%
  select(HR_per_game, BB_per_game, R_per_game) %>%
  filter(HR_per_game >= 0.4 & HR_per_game <= 1.2)

We first constructed a data frame similar to this.

dat %>% group_by(HR_per_game) %>%
  summarize(slope = cor(BB_per_game, R_per_game) * sd(R_per_game) / sd(BB_per_game))
## # A tibble: 9 x 2
##   HR_per_game slope
##         <dbl> <dbl>
## 1         0.4 0.734
## 2         0.5 0.566
## 3         0.6 0.412
## 4         0.7 0.285
## 5         0.8 0.365
## 6         0.9 0.261
## 7         1   0.511
## 8         1.1 0.454
## 9         1.2 0.440

Then, to compute the regression line in each strata, since we didn’t know the lm function back then, we used the formula directly like this. We argued that the slopes are similar and that the differences were perhaps due to random variation. To provide a more rigorous defense of the slopes being the same, which is what led to our multivariate regression model, we could compute confidence intervals for each slope. We have not learned the formula for this, but the lm function provides enough information to construct them.

# Estimated slope with function lm() - not correct
# The lm()-finction ignores the group_by()-function
dat %>% group_by(HR_per_game) %>%
  lm(R_per_game ~ BB_per_game, data = .) %>%
  .$coef
## (Intercept) BB_per_game 
##   2.1986073   0.6377959

First, note that if we try to use the lm function to get the estimated slope like this, we don’t get what we want. The lm function ignored the group_by. This is expected, because lm is not part of the tidyverse and does not know how to handle the outcome of group_by which is a group tibble. We’re going to describe tibbles in some details now. When summarize receives the output of group_by, it somehow knows which rows of the table go with which groups. But where is this information stored in the data frame?

# Output: Tibbles
dat %>% group_by(HR_per_game) %>% head()
## # A tibble: 6 x 3
## # Groups:   HR_per_game [5]
##   HR_per_game BB_per_game R_per_game
##         <dbl>       <dbl>      <dbl>
## 1         0.9        3.56       4.24
## 2         0.7        3.97       4.47
## 3         0.8        3.37       4.69
## 4         1.1        3.46       4.42
## 5         1          2.75       4.61
## 6         0.9        3.06       4.58

Let’s write some code to see the output of a group_by call. Note that there are no columns with the information needed to define the groups. But if you look closely at the output, you notice the line “A tibble, 6 by 3.”

# Output: Class
dat %>% group_by(HR_per_game) %>% class()
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

We can learn the class of the return object using this line of code, and we see that the class is a “tbl.” This is pronounced “tibble.” It is also a tbl_df. This is equivalent to tibble. The tibble is a special kind of data frame. We have seen them before, because tidyverse functions such as group_by and also summarize always return this type of data frame. The group_by function returns a special kind of tibble, the grouped tibble. We will say more about the grouped tibbles later. Note that the manipulation verbs, select, filter, mutate, and arrange, don’t necessarily return tibbles. They preserve the class of the input. If they receive a regular data frame, they return a regular data frame. If they receive a tibble, they return a tibble. But tibbles are the default data frame for the tidyverse. Tibbles are very similar to data frames. You can think of them as modern versions of data frames. Next, we’re going to describe briefly three important differences.

2.3.2 Tibbles: Differences from Data Frames

In this video, we’re going to describe some of the main differences between tibbles and data frames.

1. Tibbles display better
First, the print method for tibble is much more readable than that of a data frame. To see this, type teams on your console after loading the Baseball Lahman Database. And you will see a very, very long list of columns and rows. It’s barely readable. Teams is a data frame with many rows and columns. That’s why you see that. Nevertheless, the output just shows everything wraps around and is hard to read. It is so bad that we don’t even print it here. We’ll let you print it on your own screen.

# Tibbles display better
as.tibble(Teams)
## # A tibble: 2,835 x 48
##    yearID lgID  teamID franchID divID  Rank     G Ghome     W     L DivWin
##     <int> <fct> <fct>  <fct>    <chr> <int> <int> <int> <int> <int> <chr> 
##  1   1871 NA    BS1    BNA      <NA>      3    31    NA    20    10 <NA>  
##  2   1871 NA    CH1    CNA      <NA>      2    28    NA    19     9 <NA>  
##  3   1871 NA    CL1    CFC      <NA>      8    29    NA    10    19 <NA>  
##  4   1871 NA    FW1    KEK      <NA>      7    19    NA     7    12 <NA>  
##  5   1871 NA    NY2    NNA      <NA>      5    33    NA    16    17 <NA>  
##  6   1871 NA    PH1    PNA      <NA>      1    28    NA    21     7 <NA>  
##  7   1871 NA    RC1    ROK      <NA>      9    25    NA     4    21 <NA>  
##  8   1871 NA    TRO    TRO      <NA>      6    29    NA    13    15 <NA>  
##  9   1871 NA    WS3    OLY      <NA>      4    32    NA    15    15 <NA>  
## 10   1872 NA    BL1    BLC      <NA>      2    58    NA    35    19 <NA>  
## # ... with 2,825 more rows, and 37 more variables: WCWin <chr>,
## #   LgWin <chr>, WSWin <chr>, R <int>, AB <int>, H <int>, X2B <int>,
## #   X3B <int>, HR <int>, BB <int>, SO <int>, SB <int>, CS <int>,
## #   HBP <int>, SF <int>, RA <int>, ER <int>, ERA <dbl>, CG <int>,
## #   SHO <int>, SV <int>, IPouts <int>, HA <int>, HRA <int>, BBA <int>,
## #   SOA <int>, E <int>, DP <int>, FP <dbl>, name <chr>, park <chr>,
## #   attendance <int>, BPF <int>, PPF <int>, teamIDBR <chr>,
## #   teamIDlahman45 <chr>, teamIDretro <chr>

Now if you convert this data frame to a tibble data frame, the output is much more readable. Here’s an example.

2. Subsets of tibbles are tibbles
That’s the first main difference between tibbles and data frames. A second one is that if you subset the columns of a data frame, you may get back an object that is not a data frame.

class(Teams[,20])
## [1] "integer"

Here’s an example. If we subset the 20th column, we get back an integer. That’s not a data frame. With tibbles, this is not the case.

class(as.tibble(Teams[,20]))
## [1] "tbl_df"     "tbl"        "data.frame"

Here’s an example. If we subset a table, we get back a table. This is useful in the tidyverse since functions require data frames as input. Now whenever you want to access the original vector that defines a column in a table, for this, you actually have to use the accessor dollar sign.

class(as.tibble(Teams)$HR)
## [1] "integer"

Here’s an example. A related feature to this is that tibbles will give you a warning if you try to access a column that does not exist. That’s not the case for data frames. For example, if we accidentally write hr in lowercase instead of uppercase, with a data frame, all we get is a no. No warning. This can make it quite hard to debug code. In contrast, if it’s a tibble, and you try to access the lowercase hr column, which doesn’t exist, you actually get a warning.

3. Tibbles can have complex entries
A third difference is that while columns of a data frame need to be a vector of number strings or Boolean, tibbles can have more complex objects, such as lists or functions. Also note that we can create tibbles with the tibble function.

# Tibbles can have complex entries
tibble(id = c(1, 2, 3), func = c(mean, median, sd))
## # A tibble: 3 x 2
##      id func  
##   <dbl> <list>
## 1     1 <fn>  
## 2     2 <fn>  
## 3     3 <fn>

So, look at this line of code. We’re creating a column that actually has functions in it. You can see the output here.

4. Tibbles can be gouped
Finally, the last difference we describe is that tibbles can be grouped. The function group by returns a special kind of tibble, a grouped tibble. This class stores information that lets you know which rows are in which groups. The tidyverse functions, in particular the summarize functions, are aware of the group information. In the example we showed, we saw that the on lm-function, which is not part of the tidyverse, does not know how to deal with group tibbles. The object is basically converted to a regular data frame, and then the function runs ignoring the groups. This is why we only get one pair of estimates, as we see here. To make these non-tidyverse function better integrate with a tidyverse, we will learn a new function, the function do.

2.3.3 DO

In this video, we’ll describe the very useful do( ) function. The tidyverse functions know how to interpret group tibbles. Furthermore, to facilitate stringing commands through the pipe, tidyverse function consistently return data frames. Since this assures that the output of 1 is accepted as the input of another. But most our functions do not recognize group tibbles, nor do they return data frames. The lm( ) function is an example. The do( ) function serves as a bridge between our functions, such as lm( ) and the tidyverse. The do( ) function understands group tibbles and always returns a data frame. So let’s try to use the do( ) function to fit a regression line to each home run strata.

data("Teams")

dat <- Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(HR_per_game = round(HR / G, 1),
         BB_per_game = BB  / G,
         R_per_game  = R / G) %>%
  select(HR_per_game, BB_per_game, R_per_game) %>%
  filter(HR_per_game >= 0.4 & HR_per_game <= 1.2)

dat %>%
  group_by(HR_per_game) %>%
  do(fit = lm(R_per_game ~ BB_per_game, data = .))
## Source: local data frame [9 x 2]
## Groups: <by row>
## 
## # A tibble: 9 x 2
##   HR_per_game fit     
## *       <dbl> <list>  
## 1         0.4 <S3: lm>
## 2         0.5 <S3: lm>
## 3         0.6 <S3: lm>
## 4         0.7 <S3: lm>
## 5         0.8 <S3: lm>
## 6         0.9 <S3: lm>
## 7         1   <S3: lm>
## 8         1.1 <S3: lm>
## 9         1.2 <S3: lm>

We would do it like this. Notice that we did in fact fit a regression line to each strata. But the do( ) function would create a data frame with the first column being the strata value. And a column named fit. We chose that name. It could be any name. And that column will contain the result of the lm( ) call. Therefore, the return table has a column with lm( ) objects in the cells, which is not very useful. Also note that if we do not name a column, then do( ) will return the actual output of lm( ), not a data frame. And this will result in an error since do( ) is expecting a data frame as output.

# Error, because do() expecting a data frame as output
dat %>%
  group_by(HR_per_game) %>%
  do(lm(R_per_game ~ BB_per_game, data = .))

If you write this, you will get the following error. For a useful data frame to be constructed, the output of the function, inside do( ), must be a data frame as well. We could build a function that returns only what you want in the form of a data frame.

We could write for example, this function. And then we use to do( ) without naming the output, since we are already getting a data frame.

# Output as data frame
get_slope <- function(data){
  fit <- lm(R_per_game ~ BB_per_game, data = data)
  data.frame(slope = fit$coefficient[2],
             se = summary(fit)$coefficient[2, 2])
}

dat %>%
  group_by(HR_per_game) %>%
  do(get_slope(.))
## # A tibble: 9 x 3
## # Groups:   HR_per_game [9]
##   HR_per_game slope     se
##         <dbl> <dbl>  <dbl>
## 1         0.4 0.734 0.208 
## 2         0.5 0.566 0.110 
## 3         0.6 0.412 0.0974
## 4         0.7 0.285 0.0705
## 5         0.8 0.365 0.0652
## 6         0.9 0.261 0.0754
## 7         1   0.511 0.0749
## 8         1.1 0.454 0.0855
## 9         1.2 0.440 0.0801

We can write this very simple piece of code and now we get the expected result. We get the slope for each strata and the standard error for that slope. If we name the output, then we get a column containing the data frame.

# Name the output
dat %>%
  group_by(HR_per_game) %>%
  do(slope = get_slope(.))
## Source: local data frame [9 x 2]
## Groups: <by row>
## 
## # A tibble: 9 x 2
##   HR_per_game slope               
## *       <dbl> <list>              
## 1         0.4 <data.frame [1 × 2]>
## 2         0.5 <data.frame [1 × 2]>
## 3         0.6 <data.frame [1 × 2]>
## 4         0.7 <data.frame [1 × 2]>
## 5         0.8 <data.frame [1 × 2]>
## 6         0.9 <data.frame [1 × 2]>
## 7         1   <data.frame [1 × 2]>
## 8         1.1 <data.frame [1 × 2]>
## 9         1.2 <data.frame [1 × 2]>

So if we write this piece of code, now once again, we get one of these complex tibbles with a column having a data frame in each cell. Which is again, not very useful. All right. Now we’re going to cover one last feature of do( ). If the data frame being returned has more than one row, these will be concatenated appropriately.

# If data frame being returned has more than more one row
get_lse <- function(data){
  fit <- lm(R_per_game ~ BB_per_game, data = data)
  data.frame(term = names(fit$coefficients),
             slope = fit$coefficient,
             se = summary(fit)$coefficient[, 2])
}

dat %>%
  group_by(HR_per_game) %>%
  do(get_lse(.))
## # A tibble: 18 x 4
## # Groups:   HR_per_game [9]
##    HR_per_game term        slope     se
##          <dbl> <fct>       <dbl>  <dbl>
##  1         0.4 (Intercept) 1.36  0.631 
##  2         0.4 BB_per_game 0.734 0.208 
##  3         0.5 (Intercept) 2.01  0.344 
##  4         0.5 BB_per_game 0.566 0.110 
##  5         0.6 (Intercept) 2.53  0.305 
##  6         0.6 BB_per_game 0.412 0.0974
##  7         0.7 (Intercept) 3.21  0.225 
##  8         0.7 BB_per_game 0.285 0.0705
##  9         0.8 (Intercept) 3.07  0.213 
## 10         0.8 BB_per_game 0.365 0.0652
## 11         0.9 (Intercept) 3.54  0.252 
## 12         0.9 BB_per_game 0.261 0.0754
## 13         1   (Intercept) 2.88  0.255 
## 14         1   BB_per_game 0.511 0.0749
## 15         1.1 (Intercept) 3.21  0.300 
## 16         1.1 BB_per_game 0.454 0.0855
## 17         1.2 (Intercept) 3.40  0.291 
## 18         1.2 BB_per_game 0.440 0.0801

Here’s an example in which return both estimated parameters. The slope and intercept. We write this piece of code. And now we use the do( ) function as we used it before, and get a very useful tibble, giving us the estimates of the slope and intercept, as well as the standard errors. Now, if you think this is all a bit too complicated, you’re not alone. To simplify things, we’re going to introduce the broom package, which was designed to facilitate the use of model fitting functions such as lm( ) with the tidyverse.

2.3.4 Broom

The original task we ask for in a previous video was to provide an estimate and a confidence interval for the slope estimates of each strata. The broom packs will make this quite easy. Broom has three main functions all of which extract information from the object returned by the function LM, and return it in a tidy verse friendly data frame. These functions are tidy, glance and augment. The tidy function returns estimates and related information as a data frame.

data("Teams")

dat <- Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(HR_per_game = round(HR / G, 1),
         BB_per_game = BB  / G,
         R_per_game  = R / G) %>%
  select(HR_per_game, BB_per_game, R_per_game) %>%
  filter(HR_per_game >= 0.4 & HR_per_game <= 1.2)

fit <- lm(R_per_game ~ BB_per_game, data = dat)
tidy(fit)
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    2.20     0.113       19.4 1.06e-70
## 2 BB_per_game    0.638    0.0344      18.5 1.37e-65

Here’s an example.

# Additional important summaries
tidy(fit, conf.int = TRUE)
## # A tibble: 2 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    2.20     0.113       19.4 1.06e-70    1.98      2.42 
## 2 BB_per_game    0.638    0.0344      18.5 1.37e-65    0.570     0.705

We can add other important summaries, such as confidence intervals, using arguments like this.Because the outcome is a data frame, we can immediately use it with do to string together the commands that produce the table we are after. So this piece of code will generate what we wanted to see.

# Outcome is a data frame , we can use do()-function
dat %>%
  group_by(HR_per_game) %>%
  do(tidy(lm(R_per_game ~ BB_per_game, data = .), conf.int = TRUE))
## # A tibble: 18 x 8
## # Groups:   HR_per_game [9]
##    HR_per_game term  estimate std.error statistic  p.value conf.low
##          <dbl> <chr>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
##  1         0.4 (Int…    1.36     0.631       2.16 4.05e- 2   0.0631
##  2         0.4 BB_p…    0.734    0.208       3.54 1.54e- 3   0.308 
##  3         0.5 (Int…    2.01     0.344       5.84 2.07e- 7   1.32  
##  4         0.5 BB_p…    0.566    0.110       5.14 3.02e- 6   0.346 
##  5         0.6 (Int…    2.53     0.305       8.32 2.43e-13   1.93  
##  6         0.6 BB_p…    0.412    0.0974      4.23 4.80e- 5   0.219 
##  7         0.7 (Int…    3.21     0.225      14.3  1.49e-30   2.76  
##  8         0.7 BB_p…    0.285    0.0705      4.05 7.93e- 5   0.146 
##  9         0.8 (Int…    3.07     0.213      14.4  5.06e-31   2.65  
## 10         0.8 BB_p…    0.365    0.0652      5.59 8.94e- 8   0.236 
## 11         0.9 (Int…    3.54     0.252      14.1  1.37e-28   3.04  
## 12         0.9 BB_p…    0.261    0.0754      3.46 7.12e- 4   0.112 
## 13         1   (Int…    2.88     0.255      11.3  5.25e-21   2.38  
## 14         1   BB_p…    0.511    0.0749      6.82 3.06e-10   0.363 
## 15         1.1 (Int…    3.21     0.300      10.7  6.46e-17   2.61  
## 16         1.1 BB_p…    0.454    0.0855      5.31 1.03e- 6   0.284 
## 17         1.2 (Int…    3.40     0.291      11.7  2.33e-16   2.81  
## 18         1.2 BB_p…    0.440    0.0801      5.50 1.07e- 6   0.280 
## # ... with 1 more variable: conf.high <dbl>

Because a data frame is returned, we can filter and select the rows and columns we want.

# Select() and filter()
dat %>%
  group_by(HR_per_game) %>%
  do(tidy(lm(R_per_game ~ BB_per_game, data = .), conf.int = TRUE)) %>%
  filter(term == "BB_per_game") %>%
  select(HR_per_game, estimate, conf.low, conf.high)
## # A tibble: 9 x 4
## # Groups:   HR_per_game [9]
##   HR_per_game estimate conf.low conf.high
##         <dbl>    <dbl>    <dbl>     <dbl>
## 1         0.4    0.734    0.308     1.16 
## 2         0.5    0.566    0.346     0.786
## 3         0.6    0.412    0.219     0.605
## 4         0.7    0.285    0.146     0.425
## 5         0.8    0.365    0.236     0.494
## 6         0.9    0.261    0.112     0.410
## 7         1      0.511    0.363     0.659
## 8         1.1    0.454    0.284     0.624
## 9         1.2    0.440    0.280     0.601

So this simple piece of code gives us exactly the table we asked for. We have filtered away the intercept rows, and only show the columns we care about, the estimate and the confidence intervals. Furthermore, a table like this makes visualization with ggplot quite easy.

## Plot
dat %>%
  group_by(HR_per_game) %>%
  do(tidy(lm(R_per_game ~ BB_per_game, data = .), conf.int = TRUE)) %>%
  filter(term == "BB_per_game") %>%
  select(HR_per_game, estimate, conf.low, conf.high) %>%
  ggplot(aes(HR_per_game, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_errorbar() +
  geom_point()

So this piece of code produces this nice plot, which provides very useful information. Now we return to discussing our original task of determining if slopes change. The plot we just made using do and broom shows that the confidence intervals overlap, which provides a nice visual confirmation that our assumption that the slopes do not change with home run strata, is relatively safe.

## Glance() and Augement()
glance(fit)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
## *     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.266         0.265 0.454      343. 1.37e-65     2  -596. 1199. 1213.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
augment(fit)
## # A tibble: 949 x 9
##    R_per_game BB_per_game .fitted .se.fit  .resid    .hat .sigma .cooksd
##  *      <dbl>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1       4.24        3.56    4.47  0.0179 -0.233  0.00156  0.454 2.06e-4
##  2       4.47        3.97    4.73  0.0283 -0.258  0.00389  0.454 6.32e-4
##  3       4.69        3.37    4.35  0.0152  0.343  0.00112  0.454 3.19e-4
##  4       4.42        3.46    4.40  0.0161  0.0144 0.00126  0.454 6.33e-7
##  5       4.61        2.75    3.95  0.0232  0.660  0.00261  0.454 2.77e-3
##  6       4.58        3.06    4.15  0.0164  0.430  0.00131  0.454 5.89e-4
##  7       5.16        4.13    4.83  0.0331  0.328  0.00532  0.454 1.40e-3
##  8       4.22        3.58    4.48  0.0183 -0.266  0.00162  0.454 2.78e-4
##  9       4.59        4.20    4.88  0.0355 -0.287  0.00610  0.454 1.23e-3
## 10       4.77        3.87    4.67  0.0255  0.106  0.00314  0.454 8.58e-5
## # ... with 939 more rows, and 1 more variable: .std.resid <dbl>

Earlier we mentioned two other functions from the broom package, glance and augment Glance and augment relate to model specific and observation specific outcomes, respectively. Here we can see the model fit summary the glance returns. You can learn more about these summaries in any regression textbook. We’ll see an example of augment in a future video.

2.4 Regression and Baseball

2.4.1 Building a Better Offensive Metric for Baseball

In trying to answer how well bases on balls predict runs, data exploration let us to this model.

\[E[R~|~BB=x_1,HR=x_2]~=~\beta_0~+~\beta_1x_1~+~\beta_2x_2\]

Here, the data is approximately normal. And conditional distributions were also normal. Thus, we’re justified to pose a linear model like this.

\[Y_i~=~\beta_0~+~\beta_1x_1~+~\beta_2x_2~+~\epsilon_i\\Y_i~=~runs~per~game\\x_1~=~BB~per~game\\x_2~=~HR~per~game\]

To use lm here, we need to let it know that we have two predictive variables. So we use the plus symbol as follows.

data("Teams")

fit <- Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(BB = BB / G, HR = HR / G, R = R / G) %>%
  lm(R ~ BB + HR, data = .)

tidy(fit, conf.int = TRUE)
## # A tibble: 3 x 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)    1.74     0.0823      21.2 7.30e- 83    1.58      1.91 
## 2 BB             0.387    0.0270      14.3 1.20e- 42    0.334     0.440
## 3 HR             1.56     0.0490      31.9 1.75e-155    1.47      1.66

Here’s a code that fits that multiple regression model. Now, we can use the tidy function to see the nice summary. When we fit the model with only one variable without the adjustment, the estimated slopes were 0.735 and 1.844 for bases on ball and home runs, respectively. But note that when we fit the multivariate model, both these slopes go down with the bases on ball effect decreasing much more. Now, if we want to construct a metric to pick players, we need to consider single, doubles, and triples as well. Can we build a model that predicts runs based on all these outcomes? Now, we’re going to take somewhat of a leap of faith and assume that these five variables are jointly normal. This means that if we pick any one of them and hold the other for fixed, the relationship with the outcome– in this case, runs per game– is linear. And the slopes for this relationship do not depend on the other four values that were held constant. If this is true, if this model holds true, then a linear model for our data is the following.

\[Y_i~=~\beta_0~+~\beta_1x_{i,1}~+~\beta_2x_{i,2}~+~\beta_3x_{i,3}~+~\beta_4x_{i,4}~+~\beta_5x_{i,5}~+~\epsilon_i\]

With x1, x2, x3, x4, x5 representing bases on balls per game, singles per game, doubles per game, triples per game, and home runs per game, respectively.

fit_1 <- Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(BB = BB / G,
         singles = (H - X2B - X3B - HR) / G,
         doubles = X2B / G,
         triples = X3B / G,
         HR = HR / G, 
         R = R / G) %>%
  lm(R ~ BB + singles + doubles + triples + HR, data = .)
coefs <- tidy(fit_1, conf.int = TRUE)
coefs
## # A tibble: 6 x 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   -2.77     0.0862     -32.1 5.32e-157   -2.94     -2.60 
## 2 BB             0.371    0.0117      31.6 2.08e-153    0.348     0.394
## 3 singles        0.519    0.0127      40.8 9.81e-217    0.494     0.544
## 4 doubles        0.771    0.0226      34.1 8.93e-171    0.727     0.816
## 5 triples        1.24     0.0768      16.1 2.24e- 52    1.09      1.39 
## 6 HR             1.44     0.0244      59.3 0.           1.40      1.49

Using lm, we can quickly find the least square errors for the parameters using this relatively simple piece of code. We can again use the tidy function to see the coefficients, the standard errors, and confidence intervals.

Teams %>% 
  filter(yearID %in% 2002) %>% 
  mutate(BB = BB/G, 
         singles = (H-X2B-X3B-HR)/G, 
         doubles = X2B/G, 
         triples =X3B/G, 
         HR=HR/G,
         R=R/G)  %>% 
  mutate(R_hat = predict(fit, newdata = .)) %>%
  ggplot(aes(R_hat, R, label = teamID)) + 
  geom_point() +
  geom_text(nudge_x=0.1, cex = 2) + 
  geom_abline()

To see how well our metric actually predicts runs, we can predict the number of runs for each team in 2002 using the function predict to make the plot. Note that we did not use the 2002 year to create this metric. We used data from years previous to 2002.

And here is the plot. Our model does quite a good job, as demonstrated by the fact that points from the observed versus predicted plot fall close to the identity line. So instead of using batting average or just the number of home runs as a measure for picking players, we can use our fitted model to form a more informative metric that relates more directly to run production. Specifically, to define a metric for player A, we imagine a team made up of players just like player A and use our fitted a regression model to predict how many runs this team would produce.

The formula would look like this. We’re basically sticking in the estimated coefficients into the regression formula. However, to define a player’s specific metric, we have a bit more work to do. Our challenge here is that we have derived the metrics for teams based on team-level summary statistics. For example, the home run value that is entered into the equation is home runs per game for the entire team. If you compute the home runs per game for a player, it will be much lower. As the total is accumulated by nine batters, not just one. Furthermore, if a player only plays part of the game and gets less opportunity than average, it’s still considered a game play. So this means that their rates will be lower than they should be. For players, a rate that takes into account opportunities is a per-plate-appearance rate.

pa_per_game <- Batting %>%
  filter(yearID == 2002) %>%
  group_by(teamID) %>%
  summarize(pa_per_game = sum(AB + BB) / max(G)) %>%
  .$pa_per_game %>%
  mean

To make the per-game team rate comparable to the per-plate-appearance player rate, we compute the average number of team plate appearances per game using this simple piece of code. Now, we’re ready to use our metric. We’re going to compute the per-plate-appearance rates for players available in 2002. But we’re going to use data from 1999 2001. Because remember, we are picking players in 2002. We don’t know what has happened yet. To avoid small sample artifacts, we’re going to filter players with few plate interferences.

players <- Batting %>% filter(yearID %in% 1999:2001) %>% 
  group_by(playerID) %>%
  mutate(PA = BB + AB) %>%
  summarize(G = sum(PA)/pa_per_game,
            BB = sum(BB)/G,
            singles = sum(H-X2B-X3B-HR)/G,
            doubles = sum(X2B)/G, 
            triples = sum(X3B)/G, 
            HR = sum(HR)/G,
            AVG = sum(H)/sum(AB),
            PA = sum(PA)) %>%
  filter(PA >= 300) %>%
  select(-G) %>%
  mutate(R_hat = predict(fit, newdata = .))
players %>% ggplot(aes(R_hat)) + 
  geom_histogram(binwidth = 0.5, color = "black")

Here is the calculation of what we want to do in one long line of code using tidyverse. So we fit our model. And we have player-specific metrics. The player-specific predicted runs computer here can be interpreted as a number of runs we would predict a team to score if this team was made up of just that player, if that player batted every single time. The distribution shows that there’s y variability across players, as we can see here. To actually build the teams, we will need to know the players’ salaries, since we have a limited budget. Remember, we are pretending to be the Oakland A’s in 2002 with only a $$$40 million budget. We also need to know the players’ position. Because we’re going to need one shortstop, one second baseman, one third baseman, et cetera. For this, we’re going to have to do a little bit of data wrangling to combine information that is contained in different tables from the Lahman library. OK, so here we go.

players <- Salaries %>%
  filter(yearID == 2002) %>%
  select(playerID, salary) %>%
  right_join(players, by = 'playerID')

We start by adding the 2002 salaries for each player using this code. Next, we’re going to add the defensive position. This is a little bit complicated, because players play more than one position each year. So here, we’re going to pick the one position most played by each player using the top_n function. And to make sure that we only pick one position in the case of ties, we’re going to take the first row if there is a tie. We also remove the OF position. Because this stands for outfielder, which is a generalization of three positions– left field, center field, right field. We also remove pitchers, as they don’t bat in the league that the Athletics play.

players <- Fielding %>%
  filter(!POS %in% c("OF", "P"))%>%
  group_by(playerID) %>%
  top_n(1, G) %>%
  filter(row_number(G) == 1) %>%
  ungroup() %>%
  select(playerID, POS) %>%
  right_join(players, by = "playerID")
players %>% filter(!is.na(POS) & !is.na(salary))
## # A tibble: 277 x 11
##    playerID POS   salary    BB singles doubles triples    HR   AVG    PA
##    <chr>    <chr>  <int> <dbl>   <dbl>   <dbl>   <dbl> <dbl> <dbl> <int>
##  1 abernbr… 2B    2.15e5  3.16    6.91    1.99  0.117  0.585 0.270   331
##  2 alfoned… 2B    6.20e6  4.81    6.31    2.15  0.0625 1.44  0.293  1860
##  3 alicelu… 2B    8.00e5  3.55    7.16    1.65  0.387  0.419 0.273  1201
##  4 alomaro… 2B    7.94e6  4.73    7.20    2.22  0.331  1.23  0.323  1991
##  5 alomasa… C     2.50e6  1.66    7.39    1.92  0.156  0.884 0.279   745
##  6 anderma… 2B    4.00e5  2.28    6.93    2.05  0.225  0.546 0.268  1207
##  7 aurilri… SS    5.25e6  3.02    6.92    1.76  0.168  1.66  0.294  1847
##  8 ausmubr… C     5.00e6  3.74    6.39    1.82  0.324  0.524 0.259  1553
##  9 bagweje… 1B    1.10e7  6.63    5.11    2.11  0.0916 2.35  0.301  2114
## 10 bakopa01 C     7.50e5  4.38    5.46    2.04  0.180  0.360 0.234   646
## # ... with 267 more rows, and 1 more variable: R_hat <dbl>

Here is the code that does that.

players <- Master %>%
  select(playerID, nameFirst, nameLast, debut) %>%
  right_join(players, by = "playerID")

Finally, we add their names and last names so we know who we’re talking about. And here’s a code that does that.

players %>% 
  select(nameFirst, nameLast, POS, salary, R_hat) %>%
  arrange(desc(R_hat)) %>%
  top_n(10)
## Selecting by R_hat
##    nameFirst  nameLast  POS   salary     R_hat
## 1      Barry     Bonds <NA> 15000000 10.650805
## 2       Mark   McGwire   1B       NA 10.525057
## 3      Sammy      Sosa <NA> 15000000  8.910518
## 4      Manny   Ramirez <NA> 15462727  8.240557
## 5        Jim     Thome   1B  8000000  8.232642
## 6      Jason    Giambi   1B 10428571  8.052339
## 7       Jeff   Bagwell   1B 11000000  7.977242
## 8     Rafael  Palmeiro   1B  8712986  7.930042
## 9       Alex Rodriguez   SS 22000000  7.806563
## 10      Gary Sheffield   3B  9916667  7.804703

So now, we have a table with our predicted run statistic, some other statistic, the player’s name, their position, and their salary. If we look at the top 10 players based on our run production statistic, you’re going to recognize some names if you’re a baseball fan. Note the very high salaries of these players in the top 10. In fact, we see that players with high metrics have high salaries.

players %>%
  ggplot(aes(salary, R_hat, color = POS)) +
  geom_point() +
  scale_x_log10()
## Warning: Removed 117 rows containing missing values (geom_point).

We can see that by making a plot we do see some low-cost players with very high metrics. These would be great for our team. Unfortunately, these are likely young players that have not yet been able to negotiate a salary and are not going to be available in 2002. For example, the lowest earner on our top 10 list is Albert Pujols, who was a rookie in 2001.

players %>%
  filter(debut < 1998) %>%
  ggplot(aes(salary, R_hat, color = POS)) +
  geom_point() +
  scale_x_log10()
## Warning: Removed 102 rows containing missing values (geom_point).

Here’s a plot with players that debuted before 1997. This removes all the young players. We can now search for good deals by looking at players that produce many more runs and others with similar salaries. We can use this table to decide what players to pick and keep our total salary below the $40 million Billy Beane had to work with.

2.4.2 On Base Plus Slugging (OPS)

Since the 1980s sabermetricians have used a summary statistic different from batting average to evaluate players. They realized walks were important, and that doubles, triples, and home runs should be weighted much more than singles, and proposed the following metric.

They call this on-base-percentage plus slugging percentage, or OPS. Today, this statistic has caught on, and you see it in ESPN and other sports networks. Although the sabermetricians are probably not using regression, this metric is impressively close to what one gets with regression to the summary statistic that we created.

Here is the plot. They’re very correlated.

2.4.3 Regression Fallacy

Wikipedia defines the sophomore slump in the following way. A sophomore slump or sophomore jinx or sophomore jitters refers to an instance in which a second, or sophomore, effort fails to live up to the standard of the first effort.It is commonly used to refer to the apathy of students– second year of high school, college, university– the performance of athletes– second season of play– singers/bands– second album– television shows– second season– and movies– sequels or prequels. We hear about the sophomore slump often in Major League Baseball. This is because in Major League Baseball, the Rookie of the Year– this is an award that’s given to the first year player that is judged to have performed the best– usually does not perform as well during their second year. Therefore they call this the sophomore slump.Know, for example, that in a recent Fox Sports article they asked, will MLB’s tremendous rookie class of 2015 suffer a sophomore slump. Now does the data confirm the existence of a sophomore slump? Let’s take a look and examine the data for batting averages to see if the observation holds true. The data is available in the Lahman library but we have to do some work to create a table with the statistics for all the rookies of the year.

playerInfo <- Fielding %>%
  group_by(playerID) %>%
  arrange(desc(G)) %>%
  slice(1) %>%
  ungroup %>%
  left_join(Master, by = "playerID") %>%
  select(playerID, nameFirst, nameLast, POS)
head(playerInfo)
## # A tibble: 6 x 4
##   playerID  nameFirst nameLast POS  
##   <chr>     <chr>     <chr>    <chr>
## 1 aardsda01 David     Aardsma  P    
## 2 aaronha01 Hank      Aaron    OF   
## 3 aaronto01 Tommie    Aaron    1B   
## 4 aasedo01  Don       Aase     P    
## 5 abadan01  Andy      Abad     1B   
## 6 abadfe01  Fernando  Abad     P

Let’s go through them. First, we create a table with player ID, their names and their most played position, using this code. Now we will create a table with only the Rookie of the Year Award winners and add their batting statistics. We’re going to filter out pitchers since pitchers are not given awards for batting.And we’re going to focus on offense. Specifically, we’ll focus on batting average since it is the summary that most pundits talk about when discussing the sophomore slump.

ROY <- AwardsPlayers %>%
  filter(awardID == "Rookie of the Year") %>%
  left_join(playerInfo, by = "playerID") %>%
  rename(rookie_year = yearID) %>%
  right_join(Batting, by = "playerID") %>%
  mutate(AVG = H / AB) %>%
  filter(POS != "P")
head(ROY)
##    playerID            awardID rookie_year lgID.x  tie notes nameFirst
## 1  darkal01 Rookie of the Year        1948     ML <NA>  <NA>        Al
## 2 robinja02 Rookie of the Year        1947     ML <NA>  <NA>    Jackie
## 3  darkal01 Rookie of the Year        1948     ML <NA>  <NA>        Al
## 4 robinja02 Rookie of the Year        1947     ML <NA>  <NA>    Jackie
## 5  darkal01 Rookie of the Year        1948     ML <NA>  <NA>        Al
## 6 dropowa01 Rookie of the Year        1950     AL <NA>  <NA>      Walt
##   nameLast POS yearID stint teamID lgID.y   G  AB   R   H X2B X3B HR RBI
## 1     Dark  SS   1946     1    BSN     NL  15  13   0   3   3   0  0   1
## 2 Robinson  2B   1947     1    BRO     NL 151 590 125 175  31   5 12  48
## 3     Dark  SS   1948     1    BSN     NL 137 543  85 175  39   6  3  48
## 4 Robinson  2B   1948     1    BRO     NL 147 574 108 170  38   8 12  85
## 5     Dark  SS   1949     1    BSN     NL 130 529  74 146  23   5  3  53
## 6    Dropo  1B   1949     1    BOS     AL  11  41   3   6   2   0  0   1
##   SB CS BB SO IBB HBP SH SF GIDP       AVG
## 1  0 NA  0  3  NA   0  0 NA    0 0.2307692
## 2 29 NA 74 36  NA   9 28 NA    5 0.2966102
## 3  4 NA 24 36  NA   2 10 NA    4 0.3222836
## 4 22 NA 57 37  NA   7  8 NA    7 0.2961672
## 5  5 NA 31 43  NA   1 11 NA   10 0.2759924
## 6  0  0  3  7  NA   0  0 NA    2 0.1463415

So we write this piece of code to do this.

ROY <- ROY %>%
  filter(yearID == rookie_year | yearID == rookie_year + 1) %>%
  group_by(playerID) %>%
  mutate(rookie = ifelse(yearID == min(yearID), "rookie", "sophomore")) %>%
  filter(n() == 2) %>%
  ungroup() %>%
  select(playerID, rookie_year, rookie, nameFirst, nameLast, AVG)
head(ROY)
## # A tibble: 6 x 6
##   playerID  rookie_year rookie    nameFirst nameLast   AVG
##   <chr>           <int> <chr>     <chr>     <chr>    <dbl>
## 1 robinja02        1947 rookie    Jackie    Robinson 0.297
## 2 darkal01         1948 rookie    Al        Dark     0.322
## 3 robinja02        1947 sophomore Jackie    Robinson 0.296
## 4 darkal01         1948 sophomore Al        Dark     0.276
## 5 sievero01        1949 rookie    Roy       Sievers  0.306
## 6 dropowa01        1950 rookie    Walt      Dropo    0.322

Now we’ll keep only the rookie and sophomore seasons and remove players that did not play a sophomore season. And remember, now we’re only looking at players that won the Rookie of the Year Award. This code achieves what we want.

ROY <- ROY %>%
  spread(rookie, AVG) %>%
  arrange(desc(rookie))
head(ROY)
## # A tibble: 6 x 6
##   playerID  rookie_year nameFirst nameLast rookie sophomore
##   <chr>           <int> <chr>     <chr>     <dbl>     <dbl>
## 1 mccovwi01        1959 Willie    McCovey   0.354     0.238
## 2 suzukic01        2001 Ichiro    Suzuki    0.350     0.321
## 3 bumbral01        1973 Al        Bumbry    0.337     0.233
## 4 lynnfr01         1975 Fred      Lynn      0.331     0.314
## 5 pujolal01        2001 Albert    Pujols    0.329     0.314
## 6 troutmi01        2012 Mike      Trout     0.326     0.323

Finally, we will use the spread function to have one column for the rookie and another column for the sophomore years’ batting averages. For that we use this simple line of code. Now we can see the top performers in their first year. These are the Rookie of the Year Award winners. And we’re showing their rookie season batting average and their sophomore season batting average. Look closely and you will see the sophomore slump. It definitely appears to be real. In fact, the proportion of players that have a lower batting average their sophomore years is 68%.

So is it jitters? Is it a jinx? To answer this question, let’s turn our attention to all players. We’re going to look at the 2013 season and 2014 season. And we’re going to look at players that batted at least 130 times. This is a minimum needed to win the Rookie of the Year. We’re going to perform a similar operation as we did before to construct this data set.

two_years <- Batting %>% 
  filter(yearID %in% 2013:2014) %>% 
  group_by(playerID, yearID) %>%  
  filter(sum(AB) >= 130) %>% 
  summarize(AVG = sum(H)/sum(AB)) %>% 
  ungroup %>% 
  spread(yearID, AVG) %>% 
  filter(!is.na(`2013`) & !is.na(`2014`)) %>%
  left_join(playerInfo, by = "playerID") %>%
  filter(POS!="P") %>% 
  select(-POS) %>%
  arrange(desc(`2013`)) %>% 
  select(-playerID)
head(two_years)
## # A tibble: 6 x 4
##   `2013` `2014` nameFirst nameLast
##    <dbl>  <dbl> <chr>     <chr>   
## 1  0.348  0.313 Miguel    Cabrera 
## 2  0.345  0.283 Hanley    Ramirez 
## 3  0.331  0.332 Michael   Cuddyer 
## 4  0.324  0.289 Scooter   Gennett 
## 5  0.324  0.277 Joe       Mauer   
## 6  0.323  0.287 Mike      Trout

Here is the code.Now let’s look at the top performers of 2013 and then look at their performance in 2014. Note that the same pattern arises when we look at the top performers. Batting averages go down for the top performers. But these are not rookies. So this can’t be explained with a sophomore slump. Also note what happens to the worst performers of 2013.

arrange(two_years, `2013`)
## # A tibble: 312 x 4
##    `2013` `2014` nameFirst nameLast 
##     <dbl>  <dbl> <chr>     <chr>    
##  1  0.158  0.219 Danny     Espinosa 
##  2  0.179  0.149 Dan       Uggla    
##  3  0.181  0.2   Jeff      Mathis   
##  4  0.184  0.208 Melvin    Upton    
##  5  0.190  0.262 Adam      Rosales  
##  6  0.192  0.215 Aaron     Hicks    
##  7  0.194  0.229 Chris     Colabello
##  8  0.194  0.177 J. P.     Arencibia
##  9  0.195  0.241 Tyler     Flowers  
## 10  0.198  0.218 Ryan      Hanigan  
## # ... with 302 more rows

Here they are. Their batting averages go up in their second season in 2014. Is this some sort of reverse sophomore slump? It is not. There is no such thing as a sophomore slump. This is all explained with a simple statistical fact. The correlation of performance in two separate years is high but not perfect.

two_years %>%
  ggplot(aes(`2013`, `2014`)) +
  geom_point()

Here is the data for 2013 performance and 2014 performance. You can see it’s correlated. But it’s not perfectly correlated.

summarize(two_years, cor(`2013`, `2014`))
## # A tibble: 1 x 1
##   `cor(\`2013\`, \`2014\`)`
##                       <dbl>
## 1                     0.460

The correlation is 0.46. The data look very much like a bivariate normal distribution, which means that if we were to predict the 2014 batting average, let’s call it y, for any given player that had a 2013 batting average of x, we would use the regression equation, which would be this.

\[ \frac{Y - .255}{.032} = 0.46 \left( \frac{X - .261}{.023}\right) \]

Because a correlation is not perfect, regression tells us that on average, we expect high performers from 2013 to do a little bit worse in 2014. This regression to the mean. It’s not a jinx. It’s just due to chance. The rookies of the year are selected from the top values of x So it is expected that their y will regress to the mean.

2.4.4 Measurement Error Models

Up until now, all our linear regression examples have been applied to two or more random variables. We assume the pairs are bivariate normal and use this to motivate a linear model. This approach covers most of real life examples where linear regression is used. The other major application comes from measurement error models. In these applications, it is common to have a nonrandom covariates, such as time. And randomness is introduced from measurement error, rather than sampling or natural variability. To understand these models, we’re going to use a motivation example related to physics. Imagine you are Galileo in the 16th century trying to describe the velocity of a falling object. An assistant climbs the Tower of Pisa and drops a ball. While several other assistants record the position at different times. The falling object data set contains an example of what that data would look like.

falling_object <- rfalling_object()
falling_object
##    time distance observed_distance
## 1  0.00 55.86000         57.538200
## 2  0.25 55.55375         55.347234
## 3  0.50 54.63500         54.487458
## 4  0.75 53.10375         54.126534
## 5  1.00 50.96000         50.447184
## 6  1.25 48.20375         50.280506
## 7  1.50 44.83500         43.875381
## 8  1.75 40.85375         40.726193
## 9  2.00 36.26000         35.438581
## 10 2.25 31.05375         32.023337
## 11 2.50 25.23500         24.551304
## 12 2.75 18.80375         17.773798
## 13 3.00 11.76000         11.352742
## 14 3.25  4.10375          4.903877
falling_object %>% 
  ggplot(aes(time, observed_distance)) + 
  geom_point() +
  ylab("Distance in meters") + 
  xlab("Time in seconds")

The assistant hands the data to Galileo and this is what he sees. He uses ggplot to make a plot. Here we see the distance in meters that has dropped on the y-axis and time on the x-axis. Galileo does not know the exact equation, but from data exploration, by looking at the plot, he deduces that the position should follow a parabola, which we can write like this.

\[ f(x) = \beta_0 + \beta_1 x + \beta_2 x^2\]

The data does not fall exactly on a parabola, but Galileo knows that this is due to measurement error. His helpers make mistakes when measuring the distance the ball has fallen. To account for this, we write this model.

\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i, i=1,\dots,n \]

Here, y represents the distance the ball is dropped in meters. Xi represents time in seconds. And epsilon represents measurement error. The measurement error is assumed to be random, independent from each other and having the same distribution from each eye. We also assume that there is no bias, which means that the expected value of epsilon is 0.

\[ E[\epsilon]=0\]

Note that this is a linear model because it is a linear combination of known quantities. X and x squared are known and unknown parameters, the betas. Unlike our previous example, the x’s are fixed quantities. This is just time. We’re not conditioning. Now to pose a new physical theory and start making predictions about other falling objects, Galileo needs actual numbers, rather than the unknown parameters. The LSE squares estimates seem like a reasonable approach. So how do we find the LSE squares estimates? Note that the LSE calculations do not require the errors to be approximately normal. The lm( ) function will find the betas that minimize the residual sum of squares, which is what we want.

fit <- falling_object %>% 
  mutate(time_sq = time^2) %>% 
  lm(observed_distance~time+time_sq, data=.)
tidy(fit)
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    56.9      0.707     80.5  1.36e-16
## 2 time           -1.10     1.01      -1.09 2.98e- 1
## 3 time_sq        -4.65     0.300    -15.5  7.90e- 9

So we use this code to obtain our estimated parameters. To check if the estimated parabola fits the data, the broom function augment( ) lets us do this easily.

augment(fit) %>% 
  ggplot() +
  geom_point(aes(time, observed_distance)) + 
  geom_line(aes(time, .fitted), col = "blue")

Using this code, we can make the following plot. Note that the predicted values go right through the points. Now, thanks to my high school physics teacher, I know that the equation for the trajectory of a falling object is the following.

\[d = h_0 + v_0 t - 0.5 \times 9.8 t^2\]

With h0 and v0, the starting height and starting velocity respectively. The data we use follow this equation and added measurement error to simulate and observations. Dropping the ball, that means the starting velocity is 0 because we start just by dropping it from the Tower of Pisa, which has a height of about 56.67 meters. These known quantities are consistent with the parameters that we estimated, which we can see using the tidy function.

tidy(fit, conf.int = TRUE)
## # A tibble: 3 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    56.9      0.707     80.5  1.36e-16    55.3      58.4 
## 2 time           -1.10     1.01      -1.09 2.98e- 1    -3.33      1.12
## 3 time_sq        -4.65     0.300    -15.5  7.90e- 9    -5.31     -3.99

Here they are. The Tower of Pisa height is within the confidence interval for beta 0. The initial velocity of 0 is in the confidence interval for beta 1. Note that the p value is larger than 0.05, which means we wouldn’t reject the hypothesis that the starting velocity is 0. And finally, the acceleration constant is in the confidence intervals for negative 2 times beta 2.

3. Confounding

In the Confounding section, you will learn what is perhaps the most important lesson of statistics: that correlation is not causation.

After completing this section, you will be able to:

This section has one part: Correlation is Not Causation. There are comprehension checks that follow most videos.

2.1 Correlation is Not Causation

2.1.1 Correlation is Not Causation: Spurious Correlation

Association is not causation is perhaps the most important lesson one learns in a statistics class. Correlation is not causation is another way to say this.

In this course, we have described tools useful for quantifying associations between variables, but we must be careful not to over interpret these associations. There are many reasons that a variable x can correlate with a variable y, without either being a cause for the other. Here we examine common ways that can lead to misinterpreting associations. The first example of how we can misinterpret associations are spurious correlations. The following comical example underscores that correlation is not causation.

The example shows a very strong correlation between divorce rates and margarine consumption. The correlation is 0.93. Does this mean that margarine causes divorces, or do divorces cause people to eat more margarine? Of course, the answer to both these questions is no. This is just an example of what we call spurious correlations.

Spurious Correlations

You can see many, many more observed examples in this website completely dedicated to spurious correlations. In fact, that’s the title of the website. The cases presented in the spurious correlation site are all examples of what is generally called data dredging, or data phishing, or data snooping. It’s basically a form of what in the United States, they call cherry picking. An example of data dredging would be if you look through many results produced by a random process, and pick the one that shows a relationship that supports the theory you want to defend. A Monte Carlo simulation can be used to show how data dredging can result in finding high correlations among variables that are theoretically uncorrelated.

N <- 25
G <- 1000000
sim_data <- tibble(group = rep(1:G, each = N), X = rnorm(N*G), Y = rnorm(N*G))

We’ll save the results of a simulation into a table like this. The first column denotes group and we simulated one million groups, each with 25 observations. For each group, we generate 25 observations which are stored in the second and third column. These are just random, independent normally distributed data. So we know, because we constructed the simulation, that x and y are not correlated.

res <- sim_data %>% 
  group_by(group) %>% 
  summarize(r = cor(X, Y)) %>% 
  arrange(desc(r))
head(res)
## # A tibble: 6 x 2
##    group     r
##    <int> <dbl>
## 1 159028 0.853
## 2 323141 0.804
## 3 750806 0.790
## 4 388302 0.778
## 5 299200 0.758
## 6   8556 0.751

Next, we compute the correlation between x and y for each group, and look for the maximum. Here are the top correlations.

sim_data %>% filter(group == res$group[which.max(res$r)]) %>%
  ggplot(aes(X, Y)) +
  geom_point() + 
  geom_smooth(method = "lm")

If we just plot the data from this particular group, it shows a convincing plot that x and y are, in fact, correlated. But remember that the correlations number is a random variable.

res %>% ggplot(aes(x=r)) + geom_histogram(binwidth = 0.1, color = "black")

Here’s the distribution we just generated with our Monte Carlo simulation.
It is just a mathematical fact that if we observe random correlations that are expected to be 0, but have a standard error of about 0.2, the largest one will be close to 1 if we pick from among one million. Note that if we performed regression on this group and interpreted the p-value, we would incorrectly claim this was a statistically significant relation.

sim_data %>% 
  filter(group == res$group[which.max(res$r)]) %>%
  do(tidy(lm(Y ~ X, data = .)))
## # A tibble: 2 x 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)    0.303    0.105       2.89 0.00835     
## 2 X              0.716    0.0915      7.83 0.0000000621

Here’s the code. Look how small the p-value is. This particular form of data dredging is referred to as p-hacking. P-hacking is a topic of much discussion because it is a problem in scientific publications. Because publishers tend to reward statistically significant results over negative results, there’s an incentive to report significant results. In epidemiology in the social sciences for example, researchers may look for associations between an average outcome and several exposures, and report only the one exposure that resulted in a small p-value. Furthermore, they might try fitting several different models to adjust for confounding and pick the one model that yields the smallest p-value. In experimental disciplines, an experiment might be repeated more than once, and only the one that results in a small p-value are reported. This does not necessarily happen due to unethical behaviour, but rather to statistical ignorance or wishful thinking. In advanced statistics courses, you’ll learn methods to adjust for what is called the multiple comparison problem.